CLASS - data from Neter, Wasserman and Kutner, p. 1122 (nested design).
Average test SCOREs in classes taught by six different INSTRUCTors at three different SCHOOLs. Instructor is nested within school, and class is nested within instructor.
Load and prepare data
class = read.table("../data/class.txt", header=TRUE);
class;
## school instruct score
## 1 1 1 25
## 2 1 1 29
## 3 1 2 14
## 4 1 2 11
## 5 2 1 11
## 6 2 1 6
## 7 2 2 22
## 8 2 2 18
## 9 3 1 17
## 10 3 1 20
## 11 3 2 5
## 12 3 2 2
attach(class);
school.f = as.factor(school);
instruct.f = as.factor(instruct);
## Create a new vector listing all school and instructor combinations
f = school.f:instruct.f
## Do a boxplot
boxplot(score ~ f, xlab="school:instructor", ylab="score")
The mathematical model: \[ y_{ijk} = \mu + a_i + b_{j(i)} + c_{k(ij)},\] \(i = 1, \dots, a\), \(j = 1,\dots, b\) and \(k = 1,\dots, c\), where there are
In this example, \(a_i\) is the school effect, \(b_j\) is the instructor effect, \(c_k\) corresponds to the individual students. Note that if \(A\) were a fixed-effect term, then it would not have a distribution.
Note that instructors are labeled 1 and 2 in each school, even though they are not the same instructors across schools! We need to specify the nesting in the model expression, below, using the “%in%” symbol:
m1 = lm(score ~ school.f + instruct.f %in% school.f)
anova(m1)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## school.f 2 156.5 78.25 11.179 0.009473 **
## school.f:instruct.f 3 567.5 189.17 27.024 0.000697 ***
## Residuals 6 42.0 7.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that this F-statistic and P-value are wrong, assuming we are modeling school and instructor as random effects. It should be:
\[ F = 78.25/189.17 = 0.4136491, \] \[ p = 1 - pf(0.4136491, 2, 3) = 0.6939744.\]
\[ F = 78.25/189.17 = 0.414, P = Pr(F(2,3) > 0.414) = 1 - pf(0.414, 2, 3) = 0.69. \]
See Tables 7.8 and 7.10 on text book pages 246 and 248 (attached below). These tables suggest if factor B is random, the test for factor A should use \(MS_{B(A)}\) as the denominator.
The tables below is from Montgomery Design and Analysis of Experiments, Douglas C. Montgomery, 7th and 8th Edition. The authors there used \(n\) in places where we use \(c\) (and the subscriptsa are different). Otherwise, they are the same as tables above.
The above tables also tell us how to estiamte the variance components. For example, when both factors are random (like in this example),
## Extract MS
aa = anova(m1);
ms = aa$`Mean Sq`;
## Number of levels for each factor
a = 3;
b = 2;
c = 2;
Variance components corresponding to school: \[ \hat\sigma_a^2 = (MS_A - MS_{B(A)})/(b \times c) = (78.25 - 189.1666667)/(2 \times 2) = -27.7291667 < 0 !\] Variance components corresponding to instructor (nested in school): \[ \hat\sigma^2_{b(a)} = (MS_{B(A)}-MS_{C(B)})/ c = (189.1666667 - 7)/2 =91.0833333 \] Variance components corresponding to class (nested in instructor): \[ \hat\sigma^2_{c(b)} = MS_{c(b)} = 7 \]
We can futher calculate the percentages of each component (e.g., the instructor component accounts for \(91/(91+7) = 0.9285714\) of the total variance).
Essentially, when we tell R that instructors are nested within classes, R will relabel the instuctors within different classes. We can also relabel the instructors by hand.
inst.relabel = factor(rep(1:6, each=2));
m3 = lm(score ~ school.f + inst.relabel);
anova(m3);
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## school.f 2 156.5 78.25 11.179 0.009473 **
## inst.relabel 3 567.5 189.17 27.024 0.000697 ***
## Residuals 6 42.0 7.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Compare the above ANOVA with the earlier one.
## Note that after relabeling, even if we ask R to "cross" school with instructors, R will not do it: since each instructor is only under one class.
m4 = lm(score ~ school.f * inst.relabel);
anova(m4);
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## school.f 2 156.5 78.25 11.179 0.009473 **
## inst.relabel 3 567.5 189.17 27.024 0.000697 ***
## Residuals 6 42.0 7.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boxplot(score ~ inst.relabel, xlab="instuctor", ylab="score")
If you’re not sure how to specify the nesting, you can always do a two-way ANOVA and add the SS for the main effect of instructor and for the interaction between school and instructor to get the SS for instructor nested within school.
m2 = lm(score ~ school.f*instruct.f);
anova(m2)
## Analysis of Variance Table
##
## Response: score
## Df Sum Sq Mean Sq F value Pr(>F)
## school.f 2 156.5 78.25 11.179 0.0094725 **
## instruct.f 1 108.0 108.00 15.429 0.0077312 **
## school.f:instruct.f 2 459.5 229.75 32.821 0.0005874 ***
## Residuals 6 42.0 7.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example, adding \(108.0\) and \(459.5\) will give the correct \(567.5\) (with 3 d.f.).
If we are only interested in the school effect, we can use the mean score from each instructor. We will get the correct F-test (assuming instuctor is a random effect factor). Compare the following ANOVA table to the ANOVA table earlier. (Remember we said the same thing whne discussing subsampling.) Obviously, we can no longer estimate all the variance components once the data are collapsed.
score.1 = tapply(score, inst.relabel, mean);
school.1 = factor(rep(1:3, each=2));
m5 = lm(score.1 ~ school.1);
anova(m5);
## Analysis of Variance Table
##
## Response: score.1
## Df Sum Sq Mean Sq F value Pr(>F)
## school.1 2 78.25 39.125 0.4137 0.694
## Residuals 3 283.75 94.583